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ABSTEACT ^ 

The British Education Act of stipulated that 

inst^iuction and training be offered according^ to the a^es, abilities, 
and aptitudes of pupils. One specific problem *concernea\ the entry to 
seqondaty s^hijj^ls of pupils from a variety of primary schools. The 
resulting problem of 'determining the different aptitudes and 
abilities has be'en partially solved by the use of standardized tests. 
This pamphlet isjesigned to provide a. brief introduction to the' 
methods of constructing and using standardized tests, and to discuss 
sp^ial difficulties encountered in the construction and use of 
standardized tests in Wales, a mixed language area. Specifically 
discussed are the various kinds o\f standardized tests, what is meant 
by standardizations^ what such ''tests determine, the choice of tests, 
comparijson of the results of various tests, and tl^e co'ncepts^of 
mental age, attainment ages and quotients. (CLK) 
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FOREW^D ^ ' i 

There can be no doubt that changed in the orgahisation^ erf- 
educational oHministration subsequent to the J^ucTation Act of 
1944 will place grem^r emphasis on^ educational guidance, ^and will 
render more important ih^ use of standardised 'tests^as a' means of 
fiiagnosis preparatory to remedial treatment. 

' • > • * 

That being so, it seems desirable^ that teachers shall be familiar 
with the principles underlying the' construction and use of these 
tests. I ^ S 



At the [Kg^ent time, there are few accounts available 'in the region 
between mathematical and statistical formulae which frighten the 
layman out of his wits, and easy-going general accounts of t^sts 
and festing^ which, by omitting all the technical details of standard- 
ising^ make the process appear to be mbleadingly ecfsy. 

if sianddrdised tests are to be \ised in the classrooms in th€ . 
ordinary course of educational guidance then the teachers who use 
them should be familiar with ^uch details as are necessary for 
"correct use of the tests and intelligent interpretation of the results. 

In particular, so far as Wales is concerned, there does not appear 
to exist at the moment any account which deals with the special 
problems of the construction and, use of standardised test^ in a 
mixed language area. *■ . 

This pamphlet has been written in the hope thatyti mqy fill the 
gaps indicated above. 
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Imelligpnce ^ests are commonly crit-icised. the most commonlv 
b/perons who have little understanding of, the way in which they* 
areMsed. They arc blamed for faiimg to measure things which they 
are not intended to oieasure." \ , • 

\ Macrae. Talents and Temperaments.^^' 

• " A testing programtne*^, . . must be planned for a particular 
purppse and to suit thej needs of -"particular groCips of children. 
The pupils and not the educational institution or system should 
be the mam consideration in framing such a programme. There is 
always the danger that extensive testing, particularly where results 
are not used and interpreted- effectively, can lead to rigidity and 
even sterility in teaching because measurements tend to assume 
some value in themselves, when, in fact, they have only relative ' 

' value in the light of the action which follpws." * ' 

"On the other hand teaching without use of carefully planned 
^sting may result in a good deal of misplaced effort an the part 
of teachers, and failure^ and frustration on the part of pupils. Not 
a little of this loss of achievement and frustration on the part of 
both teachers and taught couldr be avoided if objectiye measures 
'Of apprafsal. diagnosis and checking were used. A knowledge of 
standardised tests of mental ability and achievement should be part 
of an adequately trained teacher's classroom equipment. 

SCHONELL. Diagnostic and Attainrnent Testing. 

"Tests have their limitations ^as well as -their. values and one 
should know what points to observe in order 'to avoid pitfalls of 
over^evaluation or incorrect interpretation . . . One should not 
place too much reliance on an isolated test finding.** 



ScHONELL. Diagnostic and Attainment Testing. 
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WHAT TfflS PAMPHLET IS ABO] 




Dissatisfaction v^iTH ScHOLA^Ti^ Examinations 

• Scholastic examinations have had d very long history. They h£(vc ^ 
'been used by p^ple as far apart in tune and conditions as the 
iincient Chmes^ and Modem British. 

Since 1902 whcn'^Coanty Councils in Englaifd and Wale$ were 
made responsible for secondary education, admission to grammar 
' schools at age eleven-plus has depended increasingly on competi- 
tive scholastic examinations. At the sametipae access to gr amma r 
schools has become, economically afid socially, increasingly 
important. , 

It is not surprising therefore, that scholastic examinations have " 
become very much a mattpr of public concern. They have been ' 
subject to critical scrutiny by educationists, psychologists and 
statisticians with, in some cases, rather, startling Jesuits. 

If examinations arc to be used for selecting candidates for higher 
education it is essential that they shall be fair and reliablefThc 
examination of examinations, as.it has been called, has aroused, 
serious doubts about the reliability of the academic essay of type 
of examination particularly for predicting future Educational pro- 
gress and dtimate level of scholastic attainment. . 
. These doubts have stimulated a more careful analysis ojf the 
examination procedure itsel4as well as of the results it was sup- 
posed to achjeve. 

For our purpose, perhaps the gravest defect of the .traditional 
scholastic examinations was their failure to distingui^. reliably, 
between present attainment and probable future performance. 
Properly conduded, these examinations do indicate that the 
succcjssful* candidates have already attained a certain level of skill 
in writing answers in the form of essays, and have acquired certain* 
types of usable information. However, follow-up, studies of 
grammar-entrance examinations at age eleven-plus have shown, 
b^ond doiibt, ihat as instruments for predicting future academic 
- successes they arc by no means reliable. Yet when used as com- 
petitive tests for gramn^ar-school selection or university entrance, 
the predictive function of the examinations is more important than * 
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the measure of present attainment. What matters in those cases is 
not merely ho^y much the candidate has been coached to leam up 
to the pf^sent 'moment, but rather, how rapid will be his progress 
in the next few ^ears and how high his ultmate achievement. 

Comparative studies have revealed serious defects in the examina- 
tion procedure itself — ^variations in the standards and methods of 
marking, for example. In addition, it has been suggested that 3 
succ^^s depended to a significant extent on conditions which, were 
not at £^1I equal for all candidates. Good school conditions/efficient 
teaching, regular attendance, unbroken schooling, good health. \ 
gpod homes, a good niemory, speed of writing, all Favour success. 
On the other hand, bad teaching, poor school conditions* ill-healthp 

- frequent changes-of^hoah^ooriT o i nej>, emotional m atadjnst 

all hinder a candidate from rising to the level >^ich his real intel- 
lectual ability would indicate, ^ • " 



Search for more Reliable Tests. " . . . 

Consequently, since the beginning of the present century, when 
the selective indications of scholastic examinations increased in ^ 
economic as well as academic importance, we have seen a persistent 
search for more reliable tests ^of intellectual aptitude gind future 
/ scholastic attainment. * * ' * 

This search has led, among other results,' to the adoption and 
refinement of objective standardised tests as supplements of, or ' 
alternatives, to traditionaP scholastic examinations. Professor Sir \ 
Godfrey Thomson, one of the prime movers in this search said 
in an address to the National Foundhtioii for Educational Research 
, that he felt that he had a -moral duty to do everything possible to 
improve methods of discovering intelligent children who n^ght be 
overlooked, and guiding them into forms of higher^ education likely . 
^ both to make them happier in their lot and useful to a society and 
civilisation which needs them.'*' 



Misgivings about Standardised Tests. 

Howevef, the use of standardised tests, of ** intelligence " parti- 
cularly, to select candidate at age eleven-plus for admission to 
grammar schools, has aroused as mpch criticism and' opposition in 
the lay public as Ihe unreliability of essay-typc examinations 
aroused in the experts. The " intelligence " test^^^re viewed with 
marked suspicion which has been^exploited by ''sob-stufF" writers 
in qertain newspapers' and njagazines, usually peopFe with little or 
no real knowledge of the' methods by which these tests are prepared 

•Bulletin of the National Foundation for Educational Research. No 2. 
November, 1953. 
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and validated. AgOLinsl this, there is now much' experimental 
evidence that, although \>y no means perfect, correctly standardised 
tests of ** intelligence " are more reliable predictors of general 
success than traditional es^ay-type examinations. Suspicion in the 
minds of parents and the general public has ^arisen mainly from 
the fact that these tests ^would appear 'to have been used solely 
for the purpose of .excluding all but a «mall proportion of children 
of eleven«plus from ^tbe secondary gramnmr schools. 

This, however, is not a valid criticism of the tests. It is rather * 
an indication o£ the unsatisfacto ry nature of our ed ucational s ys- 
tern. The elf ecfive answer is not to abolish standardised tests out 
to provide mprer and morbs^dequate^ secondary school accommo- 

♦ ♦ . * 

Reasons Why Standardised Tests are Likely to be 

Increasingly Used*. , 

• / ' ■ 

This unfortunate association of tests of intelligence "with 
selection for grammar school places has diverted* attention from 
other important functions in educational practice which standard- 
ised tfests can fulfil without incurring the suspicions we have noted. 
There are indications at the present time -that even if examinations 
tor grammar school selection should be abolished entirely, the^ 
need for standardised tests of attainment as well as intelligertce 
will become increasingly important and increasingly wide-spi^d 
in the future. 

ImpiIbmenting the Education Act of 1944. 

It 'is not yet realised explicitly even in educational circles, how 
revolutionary in relation to traditional English notions was the 
Education Act of 1944. This enacts that each Local Education 
Authority must provide such educational facilities as will afford 

for all pupils such opportunities for education offering .such 
variety of instruction and training as may be desflrable in view of 
their different ages, abilities, and aptitudes and of the different 
periods for which they may be expected to remain at school, iiiclud- 
ing practical instruction 'and training appropriate to their respective 
needs." 

This provision of varied instruction and^ training in view of 
different ages, abilities and aptitudes * implies two processes — 
(a) providing school buildings and material facilities for the varied 
t/pes of training, and (b) devising ^methods of discovering what 
different abilities and aptitudes do. m fact, exist and what types 
06 trainftig are most suitable for their nurture, or. in other words, 
orga'nising processes of educational guidance. How can item (b) 
be accomplished reliably? 0 




Human institutions tend to be self-perpetuating. They persist 
with a perverse tenacity long after the concfitions which brought 
.them into existence originally have ceased to be importamt. This 
. is certainly true of English education. That has been organised^for 
centuries on the assumptions that the only type of education which 
mattered was provided by the.classicaUgrammar schools; and the 
only type of ability and aptitude worth serious nurture was that 
which thrived on the classical grammapischool curriculum, It -was 
most unfortun^^te that just before the passing of the 1944 Education 
Act these traditional attitudes were embodied^ in ^he psychological • 
myths of the Norwood Report which was regarded by people 
already steeped in the tradition as a fojm of Holy Writ. ' » ^ 

The challenge^ of (he T944 Act needs lo be taken up. Are there, 
in fact, abilities and aptitudes other than the linguistic, abstract, 
intellectual? If so;,what are they/wfiat Is their educational impor- 
tance for hoth the community and the individual in the modem 
world; at what age dp 'they appear: and how can they be detected 
and Jrained to full maturity? This is the problem of educational, 
guidance in which standardised tests are likely to play an increas- 
ingly*ampQrlant part. 



THt SlWATlON IN CoMPREHLNSIVE:. MULTILATLRAI. ANU 

Bilateral Secondary Sc hools. ^ 

Some Education Authorities are making experiments with 
comprehensive or * multilateral ' or * bilateral ' secondary, schools. 
In these cases; all the pupils in an* adrninistrative area who. are 
above the level of the just-not-certifiably-fccble-minded go into 
the same secondary school. This practice obviates the necessity for 
a selection examination at clcven-plus.* but the staff of any com- 
prehensive secondary school will still be fa^c^ed with the need to 
sort out pupils into kinds and grades of ability and attainment and 
to adapt both curriculum* and methods of teaching to pupils' 
varying needs. National welfare as well as individual educational 
progress demands, nowadays, that the best possible use shall be 
made of all our available ability at wliatever level it is manifested. 
Moreover it is absurd to teach the highly gifted in the same way 
and at the same pace as those of average or lower than average 
mental aptitude apart altogther from tfte adequate treatment of 
different kinds of aptitudes. f 

Again, it is b^comhig increasingly popular to sliggest that any 
sorting process necessary on entry to secondary education can best 
be done ot\ the basis of cumulative* records of the progress of 
individual pupils throughout their primary school careers. However, 
this alternative introduces its own ^^articular difficulties. Pupik 
.entering any one secondary schqol will be recruited from several 
different primary schools. How then can the estimates of aptitudes 
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and attainment made by different teacher's in different schools be 
compared fairly! There is no lack of objective evidence that the 
Judgments of different teachers even in the same school 'vary -signi- 
ficantly. Therefore, if pupils on entry to secondary schools are to, 
be sorted out reljably on the basis of records coBdfpiled in different 
primary schools there must be sorile guarantee that the records in 
question are all expressed in terms of oqe standard scale. As we 
shalKsee' later, a standardised test of '* intelligence " or attainment 
is itself d standard scale. Therefore 'all records of-aptitude or 
attainment obtained from the :same standardised lest correctly 
administered according to the prescribed instructions can be 
comp^ed fairly, even though ih6 pupils concerned come from 
different schools. . - 



Backwardness and Educational RetardatioSi. ^ 

.Further, the introduction of secondary education for all according 
to the Act of »1944 has revealed to the horrified gaze ot^6condary 
school teachers the extent and difficulties of the problems of back- 



Previously, whert admission to secondary sthools was determined 
, largely by success in a scholastic examination, secondary school 
teachers could reasonably expect that pupils of .eleven-plus should 
have reached an accepted minimum standard of attainment in 
lahgp?ige, reading and arithmetic. 

This expectation is no longer reasonable. When all pupils of 
eleven-plus proceed automatically to secondary schools^'then the 
secondary schools must perforce accept a wide range of aptitude 
and attainment in their intake. At age eleven^lus there may^be a 
range of attainment as wide as from l^ow 7 to 15 \eeiTS. That 
being, so, it is useless for second{ipy school teachers to acause their 
primary school colleagues of inefficiency, or worse. Instead of 
demanding a minimum level of scholastic attainment according to 
traditional scholastic standards the only demand that can now be 
' made legitimately by the second|iry schbol teachers is, that the 



attainment in the three R's on a par with their educable capacity. 
Thus, if a pupil of 11 years has a n^ental age of only 9 years, then 
if his* attainment ages in language, readi|ig and arithmetic are 
9 years; that pupil )has been efficientiy taught, although to the 
secondary school teacher he may appear to be seriously backward. 
Secondary schools in ?he new dispensation will have to learn not^ 
to expect the impossible^ 

Morjfover .they themselycs will have to deal efficiently with the 
problems of the backward children. It is essential in the treatment 
of a backward child to discover (a) whether the -child is merely 
retarded or (b) whether he is backward because of innate dullness. 




pupils from the primary schools 
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. In either casCi successful treatment will depend on adaptiirg .the 
levd of difficulty of his work to that 15^% real mental capacity. 

The most reliable way of esUteia^|p]§[^mentaf and attainment 
ages of a group of pupils is bl^^pfC or 
They are not only, more reliable ^lididors of general educational 
^capacity and future acaden\ic progress- than the traditionat.examina- 
tions; they are also nfiuch more sensitive indicatqrs of scholastic 
defect.^ An actual example will illustrate the value of standardised 
tests for diagnostic purposes. A group of 1 11 pupils, aged eleven- 
plus, ,the yearly intake oi a secondary modem school, was tested 
in vocabulary and reading' cppqprehcnsion with the following 
results—* . ^ 
- - ^ . — '-'--i- - ' - ■ . ■ - 

ScHONFLL^oVABUtARY Test, . 



Vocabulary Agc» No. of Cases Vocabulary Age No. of Cases 

' 15 1 10 , 22 

M4 3 ' " 9 ' 16 

13 -^2 8 / • 8 

1 2 -(normal) 5 ^ 7 or less 33 

11 ,11 ^ — : — - 

. ' • ' Total 1 1 1 



ScHONELL Test of Reading Comprehension. 



Rcadinn Age No. df Cases 1 Rcadiirg Age No. of Cases 

' 15 0 t^; - .10 20 

14 . 0 - 9 . 25 

13 2 • . 8 • 21 

. 12^nformal) 1 ' 7 or less 30 

* .11 12 ' - 

. . , ^ Total 111 

Here we find a secondjiry school fqtake all at the same chrono- 
' logical age' with a range of vocabulary ages from »less than 7 up 
to \5 years, and of reading ages from less thm 7 up to 13 years. 
Obviously this intake must be 'taught by methods and exercises 
appropriate to their attainment ages. It is possible, nowadays, when 
the" attainment ages have been discovered by the use of sland^ifrf^ 
iscd tests, to consult schedules which indicate appropriati>^5ooks 

• For these data I am indebted to Mr. E. S. Thomas. Pembroke Dodc. 
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apd exercises'foP'given^vocabulary and-reading ages.f ThSs exanipic 
wi}l serve to -^ow how much more .definite afid therefore practically 
useful are tlie data revealed, by the standardised test than impres- 
: sions gained by casual obsetvatibn. ' r 

New ^:oliditionSi even in educational administration and Jftaching 
practice, demand, ne^ methods. The need for differentiation of 
, 'curricjjla in order to,adapt'them to the laeeds of pupils of different 
^ levels^ef jntellectaai capacit>t and maturity, particularly . at the 
secondary^^stage, is tikely to demand t{ie applicajion of standardised 
* tests.to an increasing extent in- the not-too-distant future. It seems 
desirable, therefore, tliat som^ loiowledge of thq construction and 
use of these tests, should "become an established part of every 
teacher's 't>aiiHng.'' , / . 

. Special Problems. IN Wales. ^ ' 

So faa: as differentiation of schools and curricula and th^ special 
» problems of.<»secondary educatioil- are concerned, administrators 
and teachers in Wales are faced with difficulties pf the same type 
* a? their colleagues in England. However, in Wafcs there ^ special 
problem^ In connec:t^bn ^'ith ^he mixture of 'languageF and the 
. ' implementation* of a .bilingual education^ policy. As we shall ;see 
^vlater this policy introduqes problems the solution of which woyld 
' be helped .very much' by the^use.of standardised jtests. At the same 
' time, the constructiop and standardisation of tests in Welsh, and 
the." us^ .of standardised, ^ests whether Welsh or English involves 
^ difficulties peculiar to Wales: , ' 

The object' of this pamphlet 5s to priTvide a brief introduction 
Id modern, methods of constructing, arid using standardised tests 
and*to discuss the special difficulties involvednn this work in Wales. 
The topics will be treated here only in sufficient detail to give 
teachers and educational administrators some insight into the 
nature, construction and use of these tests for practical purposes. 
Readers who are interest^-.can follow the topics into more detail 
in the 'references indicated in the bibliography. 

tSee, for example, Schonell: Diagnostic and Attainment Testing (p. 162). 
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WHAT ARE STANDARDISED TESTS? 

COMPARISON WITH SCHOLASTIC EXAMINATIONS 



An accepted canon of good '^teaching is to proceed ^from the 
known to the unknown. We can, wfth profit, use the principle irf 
an approach to the discussion of standardised tests and b^glA with 
what all teachers are only too familiar, the traditionahexamiijation. 

Any examination represents an attempt to take samples of a 
candidate's total knowledge and sT^ilL From the reSults of the four, 
five, six. seven, or eight sample questions which the ^traditional 
examination paper contains, the examine^ judges i^hether or not 
the whole of the -candidate's knowledge or sliiUls satisfactory. 

The process of examining is exactly similar in form to that used 
by a buyer of some commodity in bulk. It is impossible to view, 
in the time available, all thq artidles to be bought. Therefore the 
buyer takes out samples here and there fromjhe bulk, and makes 
his decision about the probable oilality of the wl^ole consignment . 
on the quality of the smaU saflsptes. Obviously, tl^e accuracy of his 
judgment wili depend on tp skill or the luck with' which he chooses 
his sampleSi and on the ntimber of samples he takes. The fewer, 
the samples the more will'.luck and biased judgment interfere with 
the verdict. 

Thus the first objection which can be urged against the traditional 
examination is that it contained too few questions. It did not sample 
* * he' candidate's total knpwledge or ability fairly. 'The element of 
chance or Muck' played too great a part in the verdict. 

In the second place, in large scale examinations such as the 
School Leaving Certificate,* the answers were marked by \piany 
different examiners, each with his jov/n particular (and often 
peculiar) methods of marking and standards of assessment. (See 
Fig. 1). ^ 

Further, the relative difficulty of each question, and its value in 
marks was decided by the personal subjective judgment of who- 
ever set the questions without referencc/to the actual -degree of 
difficulty the questions might present toahe catldiciates tested. 

These difficulties might not have serious consequences if^ the 
examinations were used in^ school as routine tests by teachers with 
personal knowledge of the candidates. They were most serious, 
however, when the examinations were used for competitive pur- 

* Now General Certificate of Education. 
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bs. e.g.i selection for grammar school ♦places, or admission to 
liversity. or state scholarships. 
Compared with tr^itional scholastic examii^tions^ standardised 
tests contain a large number of questions, each requirinjg a short, 
unequivocal answer. In*' this way they sample the testee's abilities 
and attainments much more thoroughly, thus reducing the element 
6f chance. * (^^ 

Writirlg'is reduced to a minimum. 

Administration of the tests is standardised. Each test must be 
given strictly, according to'1:arefully prescribed instructions. These 
ensure, as far as. possible, uniform conditions for all candidates. 

NJarking the answers is standardised. Exact directions for 
marking, and the' mark values of each answer are prescrib6d in 
the manual of instructions. . 

Thus, the tesfs may be given and marked without varying the 
conditions and standards of marking by any person capable of 
heading and understanding the instruqtions. and^honest enough -to 
obey them exactly. ^ * i * • 

Finally, the standardised test is always tried out| sometimes 
on several "bccasionS before it is published on a representative 
population of pupils as similar as possible to those for whom the 
test is intended in orde^ to discover the relative difficulty and 
power of discrimination of the i6si questions, and their general 
suitability for their purpose^ 

This process of {preliminary trial and what is called item-analysis 
will be described in more detail later. ^ 



COMMON TYPES OF STANDARDISED TESTS 



It is convenient at this stage to indicate the principle types of 
* tests in common use since the form of the test determines the 
ddtails of the processes of standardisation. 

The first standardised tests were designed to be given to children 
indivi'dlially. This, however, is a lengthy process. A Binet test fiaay 
need Anything from a half to one and a half hours to give. As the 
used of standardised tests increased, other types were devised for 
simultaneous administration to groups. The American Army Tests 
during the 1914-19 war were given to as many as five hundred 
men at a time. 

In addition to the distinction between individual and group tests. 
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{he tests vary also according to the purpose foe which they are 
intended. 

Some are tests of " intelfigence "—more accurately, of intellectual 
capacity or general dducational aptitude. In these the questions 
require, predominantly, che exercise of powers of observation, - 
. reasoning and ingenuity, i.e. general abilities common to many 
.types of intellectual and practical' activities. These tests are used 
to indicate a pupil's probable rate of inteUectual development, and 
the upper level of attainment which, given favourable conditions, 
he may reasonably be expected to achieve in the not too distant 
future. 

On the othei;. hand, some standardised tests arre intended to 
discover a pupil's present level of scholastic attainment in, for 
example, mechanical arithmetic, problem arithmetic, spelling-, 
word -recognition, vocabulary, reading-comprehension, sentence - 
structure. ^ 

The use of verbal tests of educable ^capacity presents special 
difficulties in -cases of backward readers, and in mixed language 
areas. The latter will be discussed in more detail later. For these 
reasons, non-verbal and perfOrmamce tests have been devised. 

Non-verbal test^ are " pencil and paper " tests using visual 
forms of test material in which ihe use of words is reduced to a 
tfiinimum. 

In performance tests, the pupils tested are required to perform 
some practical activity such as fitting together jig-saw patterns, or 
threading mazes. 

^ Nbn-v6lrbal and performance tests have the advantage that the 
instructions fo^ procedure can be giveif equally' well i;i any 
language. Dependence on particular word-habits is reduced to. a 
minimum even though it may not be eliminated completely. 

This classification of standardised tests can be represented in 
summary form as.follo\Vs — v 

Individual rests (e.g. Terman-Binet or Terman -Merrill) to be - 

used for testing one individual at a time. 
„ Group tests, to be used for testing whole classes at the same 
time. ' 
Both the above types may includfe — 

(a) Tests of " intelligence." that is of general ft(^ucational 

capacity. 

(i) verbal 

(ii) non-verbal 

" (iii) performance 

(b) Tests of attainment in particular scholastic subjects. 

Certain tests for special abilities (e.g. musical, mechanical, 
artistic, clerical) are now available. These are* more important for 
vocational selection and guidance and need not concern us here. 
Two further distinctions should be noted. They are important 
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for our pufpose for two ^reasons: (a) they imply diflferences in 
methods of standardisation, (b) they need to be taken into account 
when choosing tests for particular purposes and 'interpreting the 
results. ' " ' 

No test, , however accurately standardised, will give absolute 
estimates. The test results must always be interpreted in relation 
to the method^ of standardisation; the popylation used for stan- 
dardisation; tKe purpose of the test; and the population >to be 
tested. Standardised tests are measifting devices. As such they 
must be used only in the conditions and for the objectives fbr 
which they have been constructed. Only toa often, tests arc used 
for some purpose* for which- they are not appropriate. Then if tlie 
results differ from those which were^expected all standardised teSts 
are damned ^without qualification. It is just as.reasonable to quarrel 
with a yard stick because it will npt measure pounds or pint^. 

The first distinction \^ be] ween tests^ which cover extended as 
against restricted age-ranges, ^ 

The earliest tests of the Binet typ^^covercd an age-range between 
five ami Tourteen years. The latest Terman-Merrill revision of* the 
Binet test extends from two4to eighteen years. Other tests have 
ranges of six tp eleven years; seven to fourteen years, for examgle. 

However, the practical use of these tests revealed statistiCj^I and 
other difficulties, particularly at the extreme 'upper and low^y-ends 
of the scale. Partly for this reason, and partly on-accoi/ht ^f the 
increasing use of standardised tests for selecting candidates at age 
eldven-plus for grammar school entrance,* tests were standardised 
for particular restricted age-rangesj e.g. 10 (o 12 years — the limits 
usually prescribed by Loc^il Education Authorities for the gtammj^r 
school sel^tlon tests.' * • * \^ 

The restricted age-range tests have a further advantage. They 
are more sensitive tp small differences between the v^ribus candi- 
dates within the group. This is a most important consideration in 
the selection process^ at age eleven-plus. 

The second distinction is that between age-scales and point- 
scales as they may be called. • i 

,In the age-scales, each test item is evaluated directly in t6nns o 
•mrfltal or scholastic age. 

In the'point sdales, each test item is given a prescri^d marfe 
ihe total marks (or points) gained by the t/sstee on the whole test 
%re tabulated, and these totals ' are then equated with mfcntal for 
scholastic ages. > ^ / 

The meaning of the terms mental age and scholastic or attain- 
ment age will be explained in more detail later. For the present 
it is sufficient to knoW that mefital'age is applied to scores on a 
test of " intelligence " of Ihe Binet type; scholastic or attainment 
ages correspond to score^on tests of scholastic attainiiient/ To 
illustrate methods of standardisation, the discussion will be 
lestrictedto tests of intelligence. This will avoid tiresome repetition. 
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HOW ARE STANDARDISED TESTS STANDARDtSED? 

V 

A standardised test is. essentially, a scale of mental or scholastic 
measurement graduated in convenient units of intellectual or 
scholastic development. 
Standardisation involves several processes, e.g. — 
' ^- (a) discovering by actual trial, the relative difficulty and dis- 
criminating power of each test item; 
(b) arranging the test items in an order of relative^difficulty; , 

• (c) devising convenient units of mental or scholastic develop- 

ment 'and attainment; 
(d) arranging the units to form a reliable scale. ' * ' 
In actual practice, standardising a modern test to an acceptable 
degree of reliability i^ a complicated and highly-skilled process 
fDr which special training and statistical knowledge are essential. 
However^ we are interested here primarily fn the general principles 
^ underyjying the construction and use of these tests. The principles ' 
^ themselves are not really difficult to understand, n 

A Statistical Diouession. 
* ♦ 

It is desirable to dicrOBs at this point to consider three common 
statistical terms— ^ . ^ 

• (a) Arithmetic Mean" or Average. ) 
Table I illustrates a mark -list for a class of 20 pupils in a test 

with ^ maximum mark of 20. The taarks gained by each pupil 
A, B. C. Dietc. are shown in Column I. • ^ » 

TABLE 1/ 

To-Jllustrate calculation of arithmetic mean and standard 
deviation. 





Col. I 


Col. II 


Col. in 





Col. I 


Col. II 


Col. Ill 


Pupil ^ 


Mark. Deviation Deviation 


Mark 


Deviation 


Deviation 




from Av. 


Squared 




> 


from Av. 


Squared 
• 36 


A 


17 


+ 7 


49 


L "1 


16 


+ 6 


B 


12 


+ 2 


4 


M • 1 


14 


+ 4 


r6 


C 


3 


"7 


49 ^ 


N 1 


. 15 


+ 5 


■\\ . 


D 


6 ' 


' "4 


16 


o ! 


4 ^ 


-6. 




. E 


9 


"1 


1 


P 1 


10 


0 


0 


F 


10 


0 


0 


Q 1 


13 


+ 3 


9 


G 




+ 1 


1 


R 1 


10 


0 


0 


H 




-3 


9 


S ' 


10 


0 


0 


T 


9 


-1 


'1 




5 


-5 


25 


K 


8 


-2 


4 


1 1 


11 


+ 1 


1 








Totals 1 


200 




282 
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The average or arithmetic mean is the sum of all the mtfrks 
divided by the number of cases: In the exanlple shown, the total 
marks gained by all candidates amount to 200. There are 20 candi- 
dates- Thus/the average mark fox this class in this test is 200 -r 20. 

i. e. 10. .... 

The average represents what may be called the central tendency 
^or general level 6t the group. If there were a bicger proportion of 
'clever pupils in the class, the average wo'ulfi be ^cordingly higher; 
if more dull pupils, the^ average would be lower. 

(b) Standard Deviation. 

However, irijconsidering the distribution of marks^ within a group 

ii. is not sufficient to know the average or central tendency only. 
We need to know also the extent to> which the marks are spread 
or scattered along the mark scale. The usQal measure of spread; 
or scatter of a distribution is ^he standard deviation, \ 

This is calculated as follows — 

The deviation or difference of each score from the* average is 

computed (See Column*"!!, Table I); 
Each deviation is then squared (Column !!!); 
The squared deviations are adcjed together; 
The sum of^the squared deviations is divided by the number of 

cases to get the mean squdrep deviation. 
Finally we find the square root of the mean squared devia^ion 
which gives the required standard deviatiofi. * 

For those who prefer a formula — 



deviation ./sum of sgu; 

. » / number of 



Standai^d deviation ' = */ sum of squared deviations from the mean; 

cgses. 



In q^T example: 



Slandard deviation = / 2^ ==' />4.1 - 3J5. 





EFpfecT ON MaIuC LrSTS OF I>IFFERKNCnS IN EXAMINERS' AVERAGES 

AND Standard Deviations. 

can now understand more clearly how differences jc^n arise 
between different examirters marking the same batch of scripts 
rSee Fig. !). 

Examiners differ with respect to their customary average mark; 
some are more generous, others more exacting. They also (Jiffer in 
the spread of the marks along the mark scale between full marks 
and' no marks. !n some cases, candidates are bunphed close together 
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iQ the upper, middle, or lower ranges of the mark scale. In others, 
the marks* are splashed quite freely throughout the whole range. 
- Jhese tendencies are found to be characteristic of individual 
examiners and teachers^ almost as characteristic as their signatures. 
Obviously then, the idiocyncrasy of an examiner may be a nliatter 
. of crucial importance in the case of selection for secondary schools, 
or when distinctions and.failures are in que^ion. Canjiidates whose 
scripts are marked by an examiner jyfio has a generous average, 
and who bunches his marks close together in the upper range of " 
(he matk list will have a decided, and, probably, quite unmerijed 
advantage. 

In competitive examinations, therefore, it is imperativq^that all 
. examiners* mark lists shall be transformed to one standard mark ^ 
scale with a constant average and constant scatter (standard 
deviation). Both statist^ics must be controlled at the same time since 
even \{ two examiners* average marks are the same, their scatter • 
may be widely different. 

(c) The J^ormal Distribution of Scores. 

This need for one standard mark scale into which the marks ot 
individual examiners may be transformed brings, us to the con- 
ception of a normal distribution. * ' 
- The question arises immediately, which particular staridard scale 
shall be adopted for the ptirpose of scaling marks in examinations 
ancf tests? Will any standard scale be satisfactory, or is there one 
with spcbial advantages? . r ' 

This question has been -answered in practice, by noting how 
"intelligence" or scholastic attainments are actually distributed 
in a large unselccted population of pupils. By *^unselected ' is' 
meant that no bias has been used in dhooslng membejs pf the 
group — all possible variations are represented (n due proportion. 

Many hundreds oL thousandar of pupils have --been tested in 
various countries for '* intelligence " and scholastic attainment. In 
all cases, where the samples teslfed hirve been large, and in the 
y absence of speci^jl "bias in selectionNii has been found that the 
results of the tests arrange themselves mH-quile exactly in a normal 
distribution, but |n very close approximation to it. 

The exact details of a normal distmbutron need not concern us 
here. They are 'described and illustrated in any primer of statistics.* 
It is important for our purpose to note that when large, unselccted 
* populations of pupils are given the same tests, their marks, and by 
implication their •'intelligence'' and attainments, do approximate 
closely to this type of eitstribution. Further, when the average and 
standard deviation of a normal distribution are known the pro- 
portions of candidates ii) a normally distributed population which 

* Sec for example, a clear and simple exposition in Glassey and Weeks' 
The Educational Development of Children. 
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ma/ be expected to appear within any part of the rjiark scale, can 
be calculated. This is important since it implies thai the known 
characteristics Qf theT normal type of distribution can be used in 
order to check the trials of a standardised test for bias and' other 
irregularities. * 
For example, if an ** intelligence " test is given to a sufficiently 
• large and unselccted population,, and if the average I.Q. is 100, 
"^lild the standard deviation of the I.Q. scores is 15, we may escpect 
the percentage's of cases in Column I of Table 11 at each I.Q. level. 
If the average is 100 and standard deviation 16-5 instead ot 15. 
the percentages will approximate to those giverr in Column IL 

Conversely, by comparing any distribution of the marks with a 
normal distribution having the same mean and standard 'deviation 
wo can estimate the likelihood that the marking is biased or the 
population specially selected. 



TABLE. II 



/ 



Showing tha> percentage of cases which may bb expected at given levels 
of Intplligence XJupf^ient if the distribution is normal ^nd the standards 
, deviation Icnown. The averagCfin both cases is 100. 



li 



I 



II 



^1 



LQ. Level 

134 and over 
121 to 130 
111 to 120 

101 to no 

91 to 100/ 
81 to 90 
71 to 80 
70 and less 



I Proportion to be expected iProportion to bQ expected!! 
I if standard (Icviation is^| if standard deviation is || 
r 15 1 -16.5 




Note tho (Uffcrence in the proportions .due to increase in the extent of 
the scatter cfr 'dispersion. The larger the standard deviation the higner the 
< proportions of cases at either end of the distribution. 



I, 



Thus, *the normal distribution is accepted by most test-con- 
structors as a convenient basis for an ideal or standard mark-scale. 
Psychologists believe that ** intelligence " and scholastic attainments 
are, in fact, normally distributed.* Therefore, most standardised 
, tests are deliberatejy arranged so that if administered to large 
unselected populations, e.g. all the pupils in a county area, t^c 
results expressed as intelligence or attainment quotients will faSj 

* This belief has never been absolutely proved. It is a convention accepted 
as' a working hypothesis based upon circumstantial evidence. 
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into a abrm^l distribution, if at the first trials the distribution is 
not normal the test is, readjusted until the scores do. approximate 
sufficiently closely to the normal distribution. ^ • 

0 . ' 

Standardising a Test for a Wide Age Range. - 

I 

The best-known test of this type is the Stanford-Binel Scale *for 
^ measuring " intelligence." . 

This test was first devisdff by a French psychologist Binet at the 
begifming of this century. After his death .Burt in London and 
Terlnan at the Stanford University in California extended and 
' improved the test. Terman and Merrill issued a revised and 
restandardised American Edition in 1937. t 
* The main principles of te^-constmction and standardisation 
can be followed by reference to this Binet Scale. '/ 



Item Anai,ysis. 

In the first place a large number of short questions is coUected, 
several times as many as will be. needed in the finished Aest. 

Since the Binet test is intended for^ age^ range fronA 2 years 
to 18 years the list of problems must'be arranged in an order of 
increasing difficulty to correspond wilth the growth of mental 
capgicity and experience from infancy to maj^ty. 

in addition to arranging the test-items in aJiKiscending ordir of 
difficulty, it is also ndcessary to find out whether they sep<hate the 
various age-tuvels Stffjiciently clearly and by approihnately regular 
intervals. It' is obvious that if a test item is answered by as many 
four;year-olds as five-year-olds it will not discriminate sufficiently 
clearly between the five-year-old an^ four-year-old levels of "intelli- 
gence."^ 

The process df discovering the degree of difficulty "and the power 
of each test item \6 discriminate between levels of ability is known 
as item analysis. It is an essential feature of all test-construction. 
It is in this respect, as well as in using a much larger number of 
questions that the standardised test differs most significantly from 
*he orthodox essay or * quiz ' type of examination. 

How can these qualities of the lest items be discovered? 

The questions are tried out by experiment oh a large and 
tepresentative population of pupils characteristic of the region in 
which the test will afterwards be used. The percentages of pupils 
at each' age who answer each test item correctly are tabulated. 
This process is illustrated i^n Table IIT. The degree of difficulty of 
any test item is estimated by the percentage of pupils who answer 
ii correctly. If 100% give correct answers the item is too easy for 

t See Terman L.M. and Merrill* M.A.. Measuring Intelligence. Harrap. 
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^that age-grpup^ if none answer it qorreclly it is too difficult. By 
. taking averages along the rows of percentages a total order of 
difficulty can be estimated^ (sec last column. Table III). These . 
averages from item I downwards should decrease by approximately 
equal amounts. In particular, there should be no reversals in the 
descending ordhii * ^ 

The first trial of & Ust of test-items usually reveals various 
ancunalies. Some items are ambiguously ;worded. Others are not 
' suitable for children of that particular area.* These items are 
reworded or discarded. Some items .will be incorrectly placed in 
Ihe order of difficulty. It is quite impossible to discover how difficult 
a particular item is until it h^heeu tried out on a sample of 
pupils. Reversals in the otd/r of difficulty may be'remdvcd 6y 
rearranging the order of tes^/i^ems or T^y suitable rewording. ^ 

TABLE \11 ^ * 

The numbers in this table are hypothetical, for purposes of 

illustration only. ^ ^ 



Test 
Item 


5 


6 


. 7 


8 


Ages 
9 


10 


11 


12 


Ayeragc 




1 - 


50 


• 70 


90 


99 


99 


100 


100 


100 


88.5 




2 


. 45 


^5 


85 


95 


97 


98 


100 


100 


85.6. 






35 


5i 


75. 


. 90 


^ 95 


97 




100. 


• 80.6 




4' 


, 25 


40 


60 


85 


' 90 


93 


n 


99 


73.4 






15 


25 


45 


70 


80 


85 


90 


• 95 


63.2 


\ 


6 


5 


' 15 


30 


55 


70- 


80 


85 


90 


53.7 




• 0 


5 


15 


40 


60 


75 


80' 


85 


45.0 





stc 



Graduating the Scale. What Units Should be Used? 

Binet's aim was ]o construct a scale which would measure in 
readily understandable units whether any pupil was educationally 
backwafd, n^mal, or advanced for his age. He decided therefore 
to graduate ms scale in units of achievement-for-age. Thiis. if a 
child of 10 years could pass the test items answered correctly by 
the average child of 12 years then he might be said to have a mental 
(or educational) age of 12. He would be two years of mental 
development ahead of -his own chronological age. If the ten-year-old 
pupil passed the tests only up to those correctly answered by the 
average ten-year-old he would be a normal or average pupil for 
his age^Jf he succeeded only up ^ the tests passed by the average 
seven-year-old pupil his mental age would be seven and he would . 
be, therefore, three years retarded. 

* e.g. it would be useless to ask English children questions about dollars, 
* quarter *, dimes, and cents. 
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' This notion of a standard of achievement corresponding to an 
averagfe age. and a scale in ynits ot. yearly prqgress is cas]^ to 
* understand and convenient to use. It had been famUiar to teachefs 
in'elementary schools long before Binet adopted it for his scale, 
It was implied by the Revised Cod<5 for .Elementary^ Education 
issued by the Committee of Council in 1862: In that year elementary 
schoolteachers were,' in modern factory terminology, put on to 
piece-work and paid by results. At once it became, necessary 
- to devise a^method of measuring the results. Tlnis \yas done by 
dividing the period of school life from 7 years to the leaving age 
into yearly increments called standards.** Then fof each standard 
a syllabus in the three R's was prescribed arid rigidly faaint^ined. 
Pupils were transferred from the Infants departmen{s to the upper 
schootat age 7 and the normal pupil was promoted to a higher 
* standard once per year. ^ - . ' ^ . . 

Thus each year of chronological age cerfesponded throughout 
the dountry to one year of scholastic attainment. On this basis, 
l?y comparing the chronological age of a pupil with his standard/* 
his scholastic status could be assessed immediately. Thus thef 
normal 9-to-lO year old pupil would be in Standard III. If he wece, 
in Standard I he would be retarded scholastically by tjwo standards 
for years; if in Standard IV he would be one standard unit' ahead. 
1 Moreover, since the standard syllabuses were prescribed by the 
. . (Central authority in Whitehall for the whole country, it was easy 
to coriipare remilts in different scHbols. Also, v^hen a pupil moved 
from one school to another it was equally easy to allocate him to 
the standard appropriate for his age and ability. 

This notion of achievement for age,, is still use^i in official docu- 
ments relating to educational sub^ormality. The • Mmistry qt 
Education Pamphlet No. 5 states (p. 19)— 

** No child should be classed as educationally sub-normal unless 
he is retarded in school work, but some agreement should be 
arrived at on the degree of retardation that would justify special 
educational trciiqient ... It is suggested that a large body of 
opinion would be^und to favour giving special educational 
' treatment if a child is so retarded that his ^standard of work is 
below that achieved by average children 20% younger than he 
«... All degrees of retardation may be found among educable 
children from the minimum indicated abofve to a maximum which 
may be as much as 50% where the child can do school work only 
as difficult as that done by average children half his age.**^ 

In principle, the Binet scale is the same as the old elementary 
school standard scale. However, the grea^ merit of the Binet scale 
lies in the fact that the ** standards of average achievement-for- 

• Italicg mine - 20% ^ungcr is equivalent to a mental ratio of 80/100 
i.e.. 80 1.0. The majdmum retardation abgve is 50/100 or 50 I.Q. in 
terms of the Binet-typc scales. 
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age are determined not by what H.M. Inspectors deenrit desj^ble 
^ for pupils to achieve, but , by what is actually adiieve<?by a 
representative sample of pupils at a specified age. The levels of 
difficultv are fixed for the modem standardfsed test* by experiment 
not by inspection. ^ . , ^ 

Having adopted ' these units of achievemen^-fbr-age it now 
remains to calibrate the mental scale correctly. This aspect of the ' 
problem has aroused more than jt little controversy. 

For an achiev^mcnt-for-age scale to work satisfactorily in 
practice it must be ciOibrated in such a way that the average or 
rformaJ pupil mil always score a mental age equal to his clrono- 
logical age e,g. the average seven -year-old wiU achieve a 'mental' 
age of VII. and similiarly for each age throughout the ranee 
covered by, the test. ^ • * ^ 



nOURE II 



To illustrate tjie convention used for scaling Mental Ages ^ 
correspdpdingf to Chronological Ages. 



Chxooological Mental 
Ages , Ages 



t)rder of 
Test Items 



2 years —I—'—. — ^lyrs. 



21 years 
3 %ars • ~( 



r^^.ycar^--^> 



Test Items 
for 3rd (IJJrd) 
yeaf. 



Test Iteqas 
—III yrs., 6 mths. for 4th (IVth) 



— I— II yrs., 6 mths. 
1— m yrs.— -^-^ 



4 .>xars 
41 years 



» 5 years 
51 years — 
6 years 



year. 



IV yrs. 



Test Items 
—IV yrs., 6 mths. for 5th(Vth) 
year. 



Test Items 

-V yrs./ 6 mths. for 6th(VIth) 
year. 

-Vl yrs. 



The problem now is to allocate eacfi test item found to te satis- 
factory by thfir item analyses correctly to its appropriate year. 
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Burt^s solution of this- prohlem is as follows * (The argument 
can be followed by reference, to Figure 11). We have a chronological 
;age-scale given by the calendar ages* of the pupils. We need a 
corresponding mental age-scale made up of the test-items arranged 
in corresponding years of mental development or achievement. It is ' 
convenient to use Arabic numerals tor Chronological age and 
Roman numerals i6x mental age. ^ 

' The pupils are grouped according to their ages at last birthday. 
Thus, m a large representative sample arranged m order of ages 
the average ^-year-old pupil will be mid-way in the gr^up who 
are 4-but-not-yet-5 *years and so on for the other age-groups. 

Now imagine that ail the test items have actually been allocated 
correctly on |;Jie mental-age scale. The items for year V must be 
suitable for the pupils in the group 4-but-not-yet-5 years in order 
to give the average child of exactly 5 years a mental age of V. 
Such a child if correctly measured should pass all the questions 
appropriate for the Vth year. This may seem somewhat confusing 
but it resembles the convention we use in calling the years from 
1700 to l799Uhe 18th century. 

Then it follows that to place the average 4^-year-old pupil 
correctly at lV yearsr 6. months on the mental age-scale, approx- 
imately half the' 4-but-not-yet-5-year-old group must pass all the 
"iests for the Vth year. ^ 

Hence in standardising a mental age-scale of the Binet type a 
test item is correctly allocated to a given mental age if it is passed 
by 50% of the group nominally one year below that age. E.g. a 
lest item will be correctly placed for year V if it is'passed b^ 50% 
of the 4-but-not-yet-5 year-old. group. ^ 

Perhaps a clearer example of this principle is given in a report 
on the standardisation of "a graded wprd-reading test l?y P. E. 
Vernon.* The word "threaten" was- read correctly by tie fol- 
lowing percentages of pupils — 

Age 6 but 7 but 8 but 9 but 10 but 11 but IZ but 
not 7 not 8 not 9 not 10 not 11 not 12 not 13 

% 6 14 29 J49 67 84 7~95 

It was. therefore, assigned to a reading age of IX years 6 months. 
To be assigned a reading age of X years exactly a word' would 
need to be read by -50% of the pupils aged 9i but not yet lOJ, and 
so on. 

Thus, on the basjs of the item-analysis unsatisfactory test items 
are disparded. others are reworded, and the list is tearranged lintil 

' See Mental and Scholastic Tests, p. 152. 

* See The Standardisation of a Gra^d V/ord Reading Test. p. 14^ 
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the test a's a \yhole gives an average pupil a mental age equal to 
his chronological age. 

The lay public does not realise sufficiently the immense amount 
of work involved in calibrating a reliable mental scale of the Binet 
type. The latest revision of the Stanford-Binet scale required the 
tull time of a team of research workers- for nine years. The test- 
items were re-afranged and re-tested no fewer than six times in 
the course of the standardisation. ^ 



Standardising a Point-Scale. 

The Binet scale was standardised directly in terms of ugits of 
mcntah age. In some cases however, e.g. in attainment tests in 
Arithmetic, English. Reading. Spelling, it is more convenient to 
assign marks (or points) for each correct answer. The total marks 
gained are arranged in , age-groups e.g. 6-but-not-yet-7. etc. The 
average mark for each group is calculated. Then a graph is drawn 
on which the average mark scored by the 6-but-not-yet-7 group 
is made to correspond with an attainment age of VI years 6 months, 
and similarly for all the other age-groups covered by the test.* 
Then by -means of the graph intermediate attainment ages can be 
read off corresponding to any given score. 



Standardising a 'Test for a Restricted Age Range. 

The use of tests of the/ Binet type revealed certain difficulties 
implicit in the definition of mental age. particularly at the upper > 
limits of the scale and in the case of very bright pupils, for whom 
the test items are not sufficiently difficult. Thus, since the scale 
stops at a nominal average mental age of XVIII it is possible for 
very bright pupils of 14 years, bright pupils of 16 years and average 
pupils of 18 years, all to score the same mental age although they 
canirot be equal in mental capacity. 

Again there is evidence which suggests that the rate of develop- 
inent of ''intelligence." like that of Might, slows down and finally 
ceases at some time after adolescence. Thus it is possible for an 
average adult of 18 years to score as Tiigh a mental age on the 
Binet scale as an average adult of 36 years. However, in terms of 
achievement-for-age we have no reason to suppose that the IS-year- 
old is twice as intelligent as the 36-year-old. They may quite easily 

* See Schonell. Diagnostic and Attainment Testing. In actual practice 
the process is more complicated in detail. The scatter of the ages and * 
marks at each age-grade must be checked by inspection of the stan- 
dard deviations in order to estimate the possibility of any serious 
departures from a normal distribution which would suggest errors 
in selecting the populatioi\ of pupils usedjor standardisation. 

^ N 
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be equally intelligent. Thus, tfie Binet Scale ceases to give reliable ^ 
indications at its upper age*liniits. 

In the second place, a test intended to cover a wide age-range 
can include only a relatively few test items at each age level. The 
latest revision of the Stan£ord*Binet scale includes 17 test items 
tor year XI and '18 for year XII. Of these, 11 items must be passed 
successfully at each age level if the testee is to score the corres- 
ponding mental age. As against this, a Moray House g*roup test 
of intelligence standardised for the chronological age-group con- 
sisting mainly of children between 10 years 6 months and 11 years 
5 months includes 100 test items. Thus the restricted age-limit test 
is more sensitive than the Binet scale to very small differences in 
** intelligence " or attainment. The difference may b? compared 
with that between one ruler graduated in half-inches and another 
in sixteenths. 

The need for tests having greater sensitivity to small differences 
within a group of approximately the same chronological age has 
been emphasised by the demand for standardised tests for use in 
selecting pupils of age 11 -plus -for grammar schools. For this 
reason, and on account of the anomalies in the definition of mental 
age at the upper end of the Binet scale, psychologists have concen- 
trated on standardising group tests for restricted age-limits. These 
tests require modifications in the methods of standardisation. 

Omitting statistical technicalities which do not affect the main 
principles, the process goes as follows — 

(i) more questions are assembled than will be needed for the 
final test. As many as foiw: 'times the required number may be 
collected for an ^* intelligence " test. 

(ii) the first draft is given to a trial sample of some 150 to 200 

pupils adequately representative of the age-range in question. ^ 

(iii) the percentage of the -group passing each item correctly, 
indicates as before, the relative difficulty of the item. This being 
known, all items answered correctly by less than 25% or more 
than 85% of the group are discarded.* It has been found by 
repeated experiment tnat better results 'are obtained by including 
more items of a moderate degree of difficulty at the expense of 
items which are very easy- or very difficult. ^ 

(i\5 the final draft is made up frpm fhje^items now remaining 
in such a way that the average difficulty value of all the included 
items shall be 50%. 

(v) Vas before, it is necessary that the test shall discriminate 
sufficiently between the different levels of ability and spread the 
candidates evenly along the scale. This is particularly important 

* This is the effective answer to critics who object to the tests on the 
ground that no child at the age specified could possibly answer the 
questions. Actually, it has been found by trial that some pupils can, t 
and do answer correctly. 
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if the lest is to be used for purposes of selection for grammar 
schools. To measure power of discrimination, the lest constructors 
proceed as follows. — The scripts are arranged in descending order 
of total n^iarks. fhey are ;hen divided into equal batches, six for 
example. For each test item, the number of pupils in each batch 
giving the correct answer is taBulated and set out as follows — 



Batches arranged \ . 
descending order ot 
scores 


1 1 
i top I 


2 


' 1 


4 


5 


I 

1 ^6 
[bottom 


No. ot pupils in each 


1 




f 








batch answering 


1 










I • 


Test Item 1 v • 


1 

I 20 


. 18 


10 


8 


3 


1 

1 1 


Test Item 2 


1 15 


19 


17 


14 


12 


1 12 

1 - 


Test Item 3 













In the case illustrated, almost as many pupils in the lowest batch 
have answered item 2 as in the top batch. This item^ therefore, 
does not discriminate sufficiently between *the bright ^and dull 
pupils. 

By the uscrof a suitable formula, an'^index of discrimination for 
each item can be calculated. All items below a yalue fouijd by 
experience to give the best results in practice are then discarded. 

A sufficient number of the original test-items found on the first 
trial to be most satisfactory with respect to level of difficulty and 
power of discrimination are then given a second trial, this time 
on a much large sample — usually the whole 11-plus age-group 
in a county. Several thousand pupils may be used for this trial.. 

Great care must be taken to ensure t^at the final sample is 
as nearly as possible representative, in due proportion, of all the 
various grades and conditions in the pupil population for which 
the test is intended. From this representative sample, the final 
adjustments are made and the *' mental scale " calculated. Suitable 
allowances are made for differences in chronological age within the 
limits prescribed by the test. 

Finally, since the notions of mental age and intelligence quotient 
have now passed into general circulation, and are readily under- 
stood by teachers,vthe scores on these restricted-age-limit tests are 
transformed into ** intelligence quotients.* 

Standardising THE Instructions and Methods, of Marking. 

When the test items have been analysed and the scale calibrated 
the test constructors prescribe' precise instructions for giving the 
lest and marking the answers. This is to ensure that the toets will 

♦ See later, p. 29. 
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be given and marked in the same way as that used in the process 
of standardisation itself. These instructions are issued in hand- 
books to accompany the tests. The appropriate handbook should 
always be studied carefully before a test is given* since the accuracy 
of the results depends on the test being usjpd in the same way 
and in the same circumstances as those in ^hich it was originally 
standardised. * l>. 



MENTAL AGES, ATTAINMENT AGES AND QUOTffiNTS 



Hitherto we have used the term ''mental age" to describe the 
units in which the standardised scale is graduated. In practice, as 
was stated above, mental age is kept for scores on tests of " intelli- 
gence** or general educational capacity. These tests purport to 
measure innate mental capacity in contradiction to attainment tests 
which are measures of scholastic attainment in some particular 
subject. For the latter we use the term '*attainfflent " age. The 
principles of standardisation are the. same. 

The mental or attainment age represents the level of mental 
development or scholastic achievement reached by a pupil of a 
given chronological age. From theSe two measures it is possible 
to compute a mental or attainment quotient (or ratio). This is 
done by dividing mental or attainment age by chronological age 
and multiplying by 100 (to remove fractions). 
Thus— 



Mental Age 

Intelligence Quotient = • X 100 

Chronological Age 

Attainment Age . 

Attainment Quotient = ^ X 100 

Chronological Age 



The quotient is a measure of rate of development. If a lO-year-old 
pupil has a mental age of XII years his I.Q. is 120. That indicates 
that he h^ developed as far in 10 years as, the normal or average 
pupil does in 12 ySarsl A 10-year-old pupil with the mental age 
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of VII has an LQ. of 10. fie has developed in 10 years only as 
far as the average pupi/ of 7 years. Similarly for -^^^ittainment 
quotients. 

If the rate of mental development is constant it follows that the 
I.Q. or mental ratio might be used as an indication of probable 
future progress. Whether or not the rate of development and there- 
fore, the I.Q. is. in fact, constant is controversial. Most psycholo- 
gists nowadays believe that it is approximately constant over a 
range of some two to three years, particularly during the primary 
school period i.e. 7 or 8 to 11 years. However predictions with 
respect to future mental development based on one test only must 
be accepted with caution as being no more than approximate. The 
mental age-scale, like all measuring instruments, is liable to errors 
of measurement. Also the course of mental development may be 
upset by radical changes of environment, emotional disturbances 
and other factors. For ordinary practical purposes we need to know 
whether a pupil's I.Q. falls below 90 or is within the ranges 90 to, 
110. Ill to' 120. or above 120 since it is comparatively rare for a 
pupil's I.Q. to change from below 90 to above 110. If such a case 
is found it is most probable that some serious error was made in 
Ihe original measurement. 



.-^^CKWARDNESS AS DiStlNCT FROM EDUCATIONAL RETARDATION. 

As we noted above, a most important question nowadays is, to 
what extent is a pupiKs aHainment age commensurate with his 
mental age. In other wordSj^is he .working scholastically up to the 
level of his educational capacity. For example, a pupil of II may 
have a reading age of VIII years. However, if his mental age is 
only VIIl ySars that pupil is working up to the limit of his capacity. 
He is backward but not retarded. On the other hand, if a pupil of 
1 1 years has a menial age of XJII but. as sometimes happens, an 
arithmetic age of X years oijly. then that pupil is seriously retarded 
in Arithmetic but he is not backward in the sense of being dull. 
The distinction between retarded, and backward because dull, is 
most important in educational diagnosis and treatment and it can 
be made with confidence only on the, evidence of well-standardised 
and reliable tests oF ** intelligence ** and schofastic attainment. For 
diagnostic purposes the standardised tests are far superior to school 
marks and much more reliable than mere p^son^l impressions. 
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WHAT DO STANDARDISED TESIS TEST? 



VALIDITY AND RELIABILITY 

V 

Much of the suspicion with which standardised tests have been 
regarded by the general public has arisen from ignorance of the 
ways in which the tests have been constructed, and the purposes 
for which they are intended by their constructors. 

From the previous section it should be clear that the tests 
indicate the standing of any given pupil relative to that of a 
hypothetical average pupil of his own chronological age. Th|p^ 
follows from the way in which the tests are standardised, no 
mailer whethier they cover extended or restricted age-limits. More- 
over, because the trial population in which the tests are standard- 
ised has been chosen deliberately and carefully to be representative 
of n?any classes, many schools, all grades of educablc ability or 
Scholastic attainment, each test is a sort of Common yard-stick " 
by which the wo/k of different teachers and different schools can 
be fairly compared. 

The disadvantage of the traditional non-standardised examina- 
tions was, precisely, 'that e^^h teacher, or each inspector, set the 
questions according to his own personal subjective estimate of their 
difficulty for children of the type and age in question, and each 
one marked the answers according to his own methods and his 
own personal standards of value. Repeated surveys have shown 
beyond doubii that different examination papers intended for 
children of the same type and age have differed significantly in 
difficulty, and that different examiners* methods of marking and 
standards of mark value have varied enormously. Cases are 
reported in which an identical essay type of answer has been 
awarded all grades of merit from distinction to failure, by different 
examiners all supposed to be competent for the purpose. The case 
of the non-standardised examination is very similar to that in. which 
the yard measure was taken to be the length of any individual 
draper's arm. On the other hand the standardised test may be 
compared with the standard yard used by all ^drapers. * 

We have still to ask, however, to what extent a standardised test 
is valid. In other words, does it really test what it purports to test? 

In the case of attainment tests this question is easily answered. 
A well constructed and standardised test of vocabulary, reading 
comprehension, English grammar or mechanical arithmetic is most 
likely to test, predominantly, attainment in vocabulary, reading 
comprehension, etc. A vocabulary test is not likely to measure 
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attainment in mechanical arithmeuc or vice^versa. It is compara- 
tively easy to construct a valid test of altaimnent in some scholastic ' 
subject or other because we know without much doubt what it is 
we are attempting to measure. 

On the other hand in the case of " intelhgence " the problem is 
by no means simple, lhat is why. in this discussion. ** intelligence *' ^ 
has beeoj^qualihed by inverted commas. ^ ^ 

^ The reason is. as in the case of many other terms in generjil iiSe» i 
e.g. morality." ** fair shares." ''justice." that although '*hiteili- 
gence " has passed into common us^ge and although wci^ imagine 
wc know what it means, in actual fact nobody has pucceedcd yet 
in giving it an agreed general definition. This imphes. of CQi^pc^ 
that the word is a "portmanteau" term which includes nof^one 
meaning but a complex set 9f different meanings not sorted out 
accurately in commqn usage. * 

It is in connection with the validity of intelligence tests that most 
of the controversies about standardised tests have arisen. What 
does an " mtelligence " test actually test? We must find a reason- . 
able answer to this puzzle. Otherwise we shall be in somewhat the 
same case as an individual who tries to measure pints of milk 
with a foot rule. 

To establish the validity of an " intelligence " test we must first 
agree upon an independent criterion of intelligent behaviour and 
then compare the results of. the tests with the criterion. 

Consciously or otherwise, the independent criterion which has 
been used to check the validity of " intelligence " tests has been, 
in the last analysis, ability to succeed in school or college work as 
measured by the reports of competent observers familiar with the 
sample of pupils and students in question, together with, final 
scholastic achievement. 

For many years, follow-up surveys have been made in which a 
sufficiently large sample of pupils has been tested at a given age 
e.g. seven or elevfen years. Their scholastic achievement and 
academic behaviour Jiave been recorded over a period of years 
and the resulting oxdir of merit compared with the original lest 
results. 

Because "intelligence" is so vqgue a term and because the . 
really effective independent criterion of " intelligence " has been 
in fact subsequent progress in school or college, most of the 
commonly-called tests of general intelligence are really measures 
of educability (or general educational aptitude) and should be con- 
sidered as such. 

However, although the ultimate criterion for establishing the 
validity of "intelligence" tests has been educability. at the same 
time the analysis of test-items has clearly indicated which tjrpes of 
^test-items are most closely correlated with, or saturated with 
" intelligence " as measured by the tests. Generafly speaking, test- 
items which require the application of knowledge as distinct from / 
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mere memorisation; reasoning; the appreciation of nice distinctions 
l:)etween the meanings of words; ingenuity, etc., are most highly 
saturated with whatever constitutes general ** intelligence." It is 
significant that although those psychologists who have been jnost 
influential in devising tests of ** intelligence " have differed widely 
in their theoretical definitions they have all used the same types of 
tes^-items for their practical tests. 

it is important that this question of validity should be kept in 
mind. Wh^n some journalist interested mainly in causing a sensation 
picks. put a particular test-item for ridicule he is apt to forget that 
if the test in question has. been reliably standardised, all the test- 
items in it have been carefully analysed by experiment. It is not 
possible for anybody to judge by mere inspection, whether a given 
test-item is a valid test of educability. These matters can be settled 
only by an appeal to the results of practical experiment. In the 
historical development of tests it has been found that some test- 
items which were confidently expfected to be highly saturated with 
y intelligence " ^ had, in fact, little significant connection with it, 
while others wliich the test-constructors had at first regarded with 
suspicion were found to be highly indicative of educability. 

These considerations need to be kept in mind when the lay 
public is invited to ridicule some test-item lifted from its context 
in some unspecified test. 



Reliability. 

A properly constructed test must not only be valid; it must also 
be reliable. 

Reliability, in this context, means that if the same test is given 
lo the same pupils on tWo or more occasions at intervals of time 
it will give closely sin)ilar results even if administered and marked 
by different people. Reliability is a measure of the trustworthiness 
of the test results. 

Etegree of reliability is indicated by a fraction. Complete relia- 
bility e.g. if a test gives identical results on two or more occasions 
— would be indicated by 1.0. A coefficient of reliability of lO.S 
indicates that the test results are little, if any, better than guess- 
work. Test constructors publish reliability coefficients for stan- 
darised tests, and intending users should avoid any tests which 
have not a guaranteed high reliability coefficient. For routine school 
purposes reliabiliiy coefficients should be above 0.90 and tests for 
grammar school selection should have reliability coefficients not 
les^ than 0.95. To quote from a recent catalogue, four tests listed 
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therein, two of '* inlelligence " and Iwo of attainment, have guaran- 
teed reliability coefficients of 0.979; 0.952; 0.971; 0.976.* 



CONCERNING THE CHOICE OF TESTS 



We have now- described how standardised tests are constructed; 
how the constructed tests arc tested; and what, in fact, they 
measure.ij: ii has seemed expedient to do this in some detail on 
account of the persistant and occasionally perverse criticisms and 
misunderstandings of these tests. Test-items are taken out of their 
context; no attention is paid to the way in which the items have 
been chosen, tested, and then combined within the, test as a whole, v 

From these notes we can infer certain precautions which must 
o be taken in choosing and using standardised tests.* The following 

considerations must be kept in mind: 

(a) only tests of a guaranteed high degree of reliability should be 
used; * 

(b) the tests must be given strictly according to the instructions 
for administraiion and marking. If a test is time-limited to a 
prescr^d number of minuter then those limits should be 
accurately kept. If a stop-watch is prescribed then a stop-watch 
should be used. 

(c) the tests should be used (i) only for the purpose for which they 
have been standardised (ii) only for the age-limits over which 
they have been standardised and (iii) only for pupils in condi- 
tions similiar to those on whom the test was originally stan- 
dardised. 

* Knowing (he reliability cocfBcient of a test, statisticians can estimate 
the probable limits of acci^racy of its results (provided, of course, that 
the test is given and marked strictly according to the prescribed 
instructions). For example, if a test has a reliability coefficient of 0.975 
the chances are 68 in 100 that a quotient measur(>d b^ it on a second 
occasion will not differ from the first by more than 2.37 points above 
or below; and 95 in 100 that the two results will not differ by more than 
4.74 points above or below. 

t The discussion has been restricted, for obvious reasons, to tests of 
educability and scholastic attainment. Numerous tests for specialised 
abilities are now available for use in vocational selection and guidance 
but the principles of construction and validation are the same as those 
^ described above. 
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\ No competent psychologist will claim that standardised tests are 
Absolutely accurate measuring instruments. No competent psycho- 
logist will make a dogmatic prediction about future scholastic 
{Progress, on the basis of a single testing. Nevertheless, if used 
atcording to the principles stated above there is no doubt that the 
indications for educational guidance to be gained from U battery 
o( well-chosen standardised tests are much ipore reliable than the 
results of an academic examination of the traditional type and 
lituch more useful for diagnostic [Purposes than impressions based 
ott observations meriiply. 

A lively, controversy has arisen about the possible influence of 
practice in answering tests on the intelligence quotients of children, 
and about the effects of coaching. . . ^ 

There is no doubt that :practice does raise test scores. It has 
been found however that increments of I.Q. due *to practice in 
ahswering test-items (Jiininish rapidly and that most pupils are 
" saturated " with prafc^ice after three or four trials. Some people 
seem to believe that this practice effect is a sufficient reason for 
abolishing the use of standardised tests. A more intelligent attitude 
would seem to be that all children should be tested at intervals 
during their primary school careers. 

» 
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COMPARESTG RESULTS ON DIFFERENT TESTS. EFFECT 
ON QUOTIENTS OF MEFHODS OF STANDARDISING 
AND GRADUATING TESTS 



As we have seen, the principle of a standardised scale of capacity 
or attainment is by no means new. Moreover, it is easy to under- 
stand and apply in practice, and the terms* mental age," attain- 
ment age,' ** intelligence quotient " have by now passed into 
general usage. 

Howcven the use of standardised tests for school record purposes 
and for grammar school selection may lead to comparisons between 
I.Q's. or attainment quotients derived from different types of tests. 
It seems desirable, theAfore, to point out at this stage that quotients 
are not absolute quantities. They are affected by the type of test 
used and by methods of standardisation. 

This fact need not lead to any difficulty in practice provided that 
the purpose of the tests is clearly understoocf and that each lest is 
used strictly for the purpose for whibh it was devised and stan- 
dardised. 
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A Bipet-type test is really a measure of intellectual or scholastic 
development. It' indicates in units of years of mental or scholastic 
progress the level of development reached by the pupil tested when 
compared with the progress which a hypothetical average pupil in 
the same circumstances can ibc expected to make. The object of 
any extended age-range test ]is to reveal mental or scholastic age 
and by doing so to indicate how far in advance or in arrear is the 
pupil tested. .Knowing this, the skilled teacher can estimate what 
exercises and ^methods are most likely to be educationally satis- 
factory for the pupil in question. For purposes of diagnosis, or of 
sorting out pupils within the same school for teaching purpoi^ 
it does not matter how many of the group may be approximately 
equal in development. Information about the level of development. 
I.e. niental or scholastic age, is the most important consideration. 

On the other hand, when all pupils between say lOJ years and 
11^ years in an pdministrative county* are to be tested in order to 
select a small proportion for admission to grammar schools, the 
authorities are not interested primarily in mental age and levels of 
development. They want to know which pupils in this 11-plus 
group are, intellectually and scholastically, the most able. In a 
group consisting of all the pupils in an administrative county, it is 
certain that a relatively large proportion will tend to cluster near 
the average mark of the group. Now, for the purposes of selection 
for grammar schools the administrators do not like the doubtful 
border-line cases where it is difficult to decide fairly, on objective 
evidence, between candidate A and candidate B. Therefore, in 
constructing and standardising a grammar-school selection test the 
psychologists aim deliberately at a form of test which mil separate 
out the candidates of equal chronological age as far as possible 
along the scale. They are interested mainly in selection, not in 
educational guidance and remedial treatment. 

Thus, mental or attainment age does not enter into the stan- 
dardisation of a grammar-school selection test, Instei^d, each child 
in the group is assessed by measuring his standing in a representa- 
tive group of pupils all exactly the same chronological age as 
himself. The degree of "^^ltelligence " and attainment within the 
group is measured not in years of development but in units of the 
standard deviation of the distribution of the scores made by a 
representative group in the test. The. same applies to all festricted- 
age-range tests. 

This principle of measuring can be followed by reference to the 
data in Table I. The standard deviation of the distribution of marks 
of the 20 pupils exemplified there is 3.75; the average mark is 10. 
Consider pupil A. His mark is 17. His deviation above the average 
is 7 marks. In units of standard deviation this is 7 -f- 3.75, i.e. 1.87 
units. Pupils with 14 marks will be just over 1 standard deviation 
unit above the proup average. Pupils F and I? are leyel with^ the 
group average. Their deviation from the average on this scale is 0, 
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Pupil C with 3 marks is 7 3.75, i.e. 1 .87 standard deviation units - 
below the group average. 

Thus it is possible to construct a scale graduated in units of 
standard deviation of the distribution of scores. In this way the 
lelative standing in aptitude or attainment of individuals in a group 
of children all of the same chronological age can be measured and 
compared. 

However, to the non*statistician, the notion of an intelligence 
quotient is more familiar than units of standard deviation. There- 
fore, for the greater convenience of education authorities the con- 
structors of rcstricted-age-limit tests transform their standardised 
scores into **I.(3^s".in such a way that the distribution of these 
I.Q's resembles that of the Binet-type tests. This can be done 
because, as we noted above, scores on ''intelligence " and attain- 
ment tests made by large unbiassed samples of pupils tend to fall 
into a close approximation to normal distribution. Also, when the 
arithmetic mean and standard deviation of a normaf distribution 
are known, the percentages of cases which may be expected to lie 
between the various intervals, of the distribution can be calculated. 

Thus, because the revised versions of the Binet scale give distri- 
butions of I.Q's with an average of 100 and standard deviation of 
approximately 16.5 the transformed I.Q's^ of the rcstricted-age- 
limit ^ests are roughly, but only roughly comparable with those of 
the litest revision of the Binet scale.* Inspection of the data in 
Tabic II will give some idea of the differences in the two I.Q. 
distributions. 

This explanation of the effect on I.Q's and attainment quotients 
of different types of standardised tests and different methods of 
standardisation is necessary because it is often supposed, erron- 
eously, that I.Q's are absolute quantities, and that the same child 
will score the same I.Q. no matter what type of test is uscyi to 
measure it. Then people with this wrong idea may make illegiti- 
mate comparisons between I.Q's for the same pupil on two different 
types of tests and condemn the tests for no^iving identical results. 
|t would be just as absurd to cast doubts on the process of meas- 
uring height because the same man's heignt is 2 yards or 1.825 
metres. The yards and metres represent scfiles graduated in different 
units. . ^ 

For practical purposes the differences discussed above mean, not 
that standardised tests are unreliable but that in choosing tests 
sufficient care must be taken l6 select tests which have been stan- 
dardised for the purpose for which they are required. Thus one 
would not use an extended-age-range test in order to select pupils 
of 1 1 years for grammar school entrance: or use a restricted-age- 

• The averagje I.Q's in both scales wijl be 100. The effects of differcsnt 
units of measurement will increase as the I.Q's get farther away from 
the average, either way. 
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range test to establish the mental or attainment age, and, by impli- 
cation, the degree of retardation of a pupil whose school work is 
unsatisfactory. ~" / 

It follows from the above that additional precautions must be 
taken. They are (a) if it is necessary to compare the standing of 
several pupils for any purpose whatever then they must all have 
the same test; and (b) whenever a mental or, attainment age, or 
quotient is recorded the name pf the tests used to ascertain it 
should always be stated at the same time. Then there can be no 
more ambiguity about the statement than there is about the state- 
ment that X's height is 2 yards or 1.825 metres. 

To make comparisons easier some test constructors nowadays 
are standardising all their tests on the same scale e.g. Moray House 
tests have a mean I.Q. of 100 and a standard deviation of 15. 
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PROBLEMS OF A BILINGUAL EDUCATIONAL POLICY 
FOR WALES 



In 1953 in a report entitled ''The Place of Welsh and English 
in the Schools of Wales " the Central Advisory Council for 
Education (Wales) recommended the adoption of a bilingual educa* 
tional policy. By a bilingual policy the Council meant that the 
English-speaking population of Wales should acquire as satisfactory 
a control of the Welsh language a3 most Welsh-speaking children 
have of English. 

This recommendation has been ^ accepted by fourteen of the 
seventeen Local Education Authorities in Wales. The problem is, 
of course, how to implement it most efficiently in practice. 

It would appear that standardised tests of attainment and edu- 
cable capacity could be used with advantage in this conniption. 

-1. For Survey Purposes. 

If the policy of bilingual education is to be taken seriously, then 
periodic language sufveys will be necessary to provide reliable 
estimates of the attainments of pupils of varying ages in Welsh and 
English languages. This is the only way of finding how much, if 
any, progress is being made. 

Up to the time of writing official estimates of these attainments 
have been based on very dubious statistics. The Advisory CounciPs 
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town survey is a case in point. In stating the need for^such a survey 
the Report says " The only attempts to make surveys in the past 
have been those of irftlividual authorities in respect of their own 
schools . . . and taken as a whole they have suffered from the fact 
that there werb no uniform criteria and *no agreed standards or 
methods of mterpreting the d^^a not only as between one authority 
end another but evpn in the same authority at different periods," 
In other words, data from different areas and different observers 
are useless for comf)arative purposes unless estimated by means 
of a standard scale. 

Unfortunately, the Council's own statistics suffer from precisely 
(he same defect. All the teachers in Wales in 1950 were asked for 
*he following information with respect to (i) pupils whose first 
language is Welsh; (ii) pupils whose first language is English: — 

^ A. Number of children having no knowledge of the second 
language. . 

B. Number of children who can understand but not speak the 
second language. 

C. Number, of children who can understand simple lessons in the 
second language in such ^subjects as History, Geography or 
Nature Study and can conduct elementary conversations in^ 
the second language. 

D. Number of children who can express themselves with fair 
fluency in the second language. 

In a language survey conducted by the Welsh Joint Education 
Committee in 1961 the same criteria were used. 

These criteria are, from a statistical point of view, shockingly ^ 
vague. For example, what is meant, exactly, by no knowledge? 
What is meant by understanding but not speaking the second, 
language? How much understanding is necessary jto. entitle a parti- 
cular pupil to inclusion? What is meant by " understanding simple 
lessons "?How simple must a lesson be fpr this purpose? At what 
age is understanding ' simple ' in this connection; for example, is 
simple * at age seven equivalent to ' simple ' at age eleven? What 
is * elementary ' coriversation at any given age level? What is a 
* fair fluency ' and' is any fair fluency in Welsh equivalent to a fair 
fluency in English? 

' Vague standards such as these will be decided by the subjective 
estimates of men and women teachers of different ages in different 
grades and different types of school^; teachers moreover with" 
varying language backgrounds and levels of competence in Welsh 
and English from high efficiency to little or none at all. Again, 
attitudes toward Welsh or English will constitute distorting factors 
in judgment. Those anxious to make a good show for one or other 
language will include as many as can be crowded v^o categories 
C. and D* Those who are openly or secretly opposed to the bilingual 
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or impossibly 'idealistic' and include as many as possible in 
categones A and B. Moreover, there is evidence for supposing 
that people show a tendency to be either predominantly 'includers ' 
or ' excluders ' when th^y are required to classify and this con- 
stitutes a temperamental difference of which the classifiers are quite 
unaware. v* 

It will be suggested, of course, that by and large these errors of 
judgment will cancelr-each other out. In the case of purely chance 
errors that may be correct. However, errors of judgment arising 
from the sources we have indicated above are not chance errors. 
They are systematic errors and. they will be cancelled out only if 
as many people with positive sets of attitudes are equal in numbers, 
and prestige, and intensity of attitu(je to those with negative sets. 
Concerning this important qualification there is no evidence what- 
ever. In addition, cancelling out differences in the process of 
averaging may hide differences which are vital to an adequate 
understanding of the situation in question. We still await reliable 
data with respect to the distribution of English and Welsh speech 
both geographically, and in terms of grades of educable capacity, 
attainment and linguistic background.. Such data can be provided 
by standardised attainment tests in vocabulary, oral reading and 
comprehension. 



2. For Estimating Degrees of Educable Capacity. 

To implement the bilingual policy the schools must teach both 
.languages. This .raises the question of educable capacity in relation 
to learning two languages simultaneously and» teaching them. The 
Report states the problem thus: " Having due regard to the varied 
abilities and aptitudes of the pupils and to the varied linguistic pat- 
terns in which they live, the children of the whole of Wales and 
Monmouthshire should be taught Welsh and English according to 
their ability to profit by such instruction.^ This recommendation 
implies that consideration should be given to the desirability of 
teaching only their mothertongu'e to children with relevant physical 
or mental disabilities and to children whose lack of ability together 
with a poor supporting linguistic background makes the learning of 
a second language whether English or Welsh acutely burdensome.*' 

This puts some obvious difficulties of a bilingual policy quite 
bluntly. One may ask then, how the relevant disabilities can be 
diagnosed, and measured and by whom? At what level, precisely, 
of mental capacity and linguistic ability shall be the line be drawn, 
and who will decide which-ipupils shall not be taught a second 
language? 

The Council suggests that to impose a uniform aim on all pupils 
must be undesirable for at least two reasons: — 

^ Italics mine. 
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(a) iht standards would be too low for the abler pupils; this 
would lead to boredom and frustration. 

(b) the standards would be toC^high for the weaker pupils result- 
ing in failure and loss of confidence with the cpnsequent 
aversion from further learning. The main consideratjpn must 
be the relation between a pupil's aptitude, ability and lin- 
guistic background, and the level of.attaimnent to be expected. 
As for who shall decide these matters, the answer proposed 
is, the teachers. As usual, they are left holding the baby. 



How are the^ teachers to decide? To implement the Advisory 
^Council's suggestions, they must discover in the case of each pupil: 

(i) General educable capacity; 

(ii) linguistic capacity (not by any means identical with (i);" 

(iii) present level of attainment in English and Welsh; 

(iv) the linguistic background in the neighbourhood; 

(v) what levels of attainment and rate of learning are appropriate 
having due regard to items (i) and (iv) above. 

Again, for greater efficiency it is desirable to decide (1) which^ 
if any, qt several ajternativ^ methods of teaching the first and second 
language is most effective for pupils of a given level of linguistic 
and general capacity and attainment; (2)\which is the best way 
of organising schools and classes in cases where a selective examina- 
tion is used Jor grammar-school entrance; (3) what level of attain- 
ment is to be expected or demanded from pupils with different 
levels of capacity afld ability and different linguistic backgrounds. 

It would appear obvious, therefore, that if the bilingual policy is 
to be administratively and educationally efficient, tests of educable 
capacity and linguistic attainment are needed which have been 
correctly standardised on sufficiently large and representative 
populations of children in Wales according to their linguistic back- 
ground (unless, of course, everybody prefers to^ muddle along as 
usual on platitudes and sentiment). 

This may appear to be a formidable task. However, for purposes 
of guidance to administrators and teachers it would be sufficient 
to conduct periodical tests in sample schools in representative areas 
varyjng from predominantly Welsh to predominantly English, both 
urban and rural. ^ 

This forces upon our attention certain Special problems involved 
in constructing and using standardised tests in a mixed language 
area such as Wales is at present. 

'^See. for example, the survey undertaken at the request of the Welsh 
Joint Education Committee by W. R. Jones, J. R. Morrison, J. Rogers, 
H. Saer on "The EciucaCional Attainment of Bilingual Children in 
Relation to their Intelligence and Linguistic Background." University of 
Wales Press 1957. Relevant to our discussion is the authors' warning that 
their findings must be viewed within th^ limits inevitably imposed on 
their work bij the lack of suitable and sqtisfdctory test material in Welsh. 
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CONS^TtUCTING AND USING STANDARDISED TESTS - 
IN A MIXED LANGUAGE AREA 



In Wales the linguistic background is mixed in various degrees 
^Tom almost or quite monoglot HngJish to almost monoglot Welsh. 
This raises problems in the production and use of standardised 
tests of whicii educatprs and investigators have not always been 
explicitly aware, ^ 

A glance through the earlier published accounts of iil^estigatiohs 
into bilingualism will show that the experunents were made on two 
^ supposedly homogeneous linguistic groups — one monoglot English, 
the other 'bilingual.' But no attempts were made to discover to^ 
what extent the supposedly bilingual group was, in fact, bilingual. 
Generalisations about the effects of bilingualisnj or about methods 
of dealing with supposedly bilingual groups based upon these 
experiments could be quite misleading.- Moreover until the re- 
searches of W. R. Jones on the possible influences of socio- 
economic status it was not realised that such status was correlated 
^positively with linguistic background and educational attainment*^ 
In any discussions about bilingualism in relation to educat>le 
capacity and scholastic attainment, both linguistic background and 
socio-economic status of the pupils need to be estlmatfed and' 
allowed for. - 

•In constructing and using standardised tests in Wales two funda- 
mental principles must always be kept clearly in mind. They are: — 

(a) Standardised tests can only be used, reliably, on populations 
equivalent to that on which the tests were originally standard- 
ised, 

(b) the population used for purposes of standardisation sHould 
vary significantly only with respect to the variable which that ^ 
test is supposed to measure, e.g. educable capacity, or linguistic 
aptitude or linguistic attainment. 

With regard to (a) from the account of the standardising process 
given above it is obvious that a test standardised on an. English 
population may be not at all suitable for use in W^les. not even 
for testing monoglot English-speaking pupils. Its suitability and 

' W. R. Jones. " Bilingualism and Intelligence." University of Wales 
Press. 1959. The author states 

** It appears, for example, that occupations in the professional, salaried 
and non^manual categories predominate in the English and mixed- 
English groups whereas, by contrast, the observed frequencies of such 
occupations fall distinctly below the expected frequencies in the case 
of the Welsh group." p. 42. 
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validity cannot be taken for granted until the case has been proVed 
by a re-standardisation which will check the achievement-for-age 
nornas of the populs^tion in question. 

Mere transladons of standardised English tests into Welsh will 
not give reliable results until an adequate item-aflalysis has been 
made on a representative Welsh group. There is no guarantee what- 
ever that the difiQculty values and discrimination values of the 
Welsh translations will be equivalent to those of the original English 
versions. Neither diflBculty nor discrimination values can be esti- 
m^ted by inspection. They can be found only only by actual trial 
on a representative population. Moreover, tests standardised on 
predominantly urban populations particularly in Wales will not 
give reliable norms if used subsequently on rural populations. The 
background conditions are not the sarde in the two cases. 

With regard to prmCiple (b) above, the reason is not. at first 
sight* so obvious. Syppose we wish to standardise a test in problem 
arithmetic. The pupils jjdected for the purpose must be capable of 
reading the words in which the problems are stated sufiQciently well 
to understand clearly what the problems are about, and sufficiendy 
rapidly to be able to complete a time-limited test within the limits 
allow^.^ If they cannot do so then even if they are familiar with 
the appropriate mechanical calculations they cannot solve the 
problems because they will not understand clearly what tie 
problems are. Unless there is in the tested group a minimum ability >y 
to read which is approximately the same for all members of the 
group the test will measure an indeterminate mixture .of attainments 
in both reading comprehension and arithmetic. It will measure 
neither accurately. 

Similarly in the construction and use of tests of " intelligence/* 

It has been demonstrated again and again that success on a 
standardised verbal intelligence " test depends on linguistic 
ability, on social status and home background. This^.is not difiScult 
to understand. Whatever " intelligence " may be. it can only reveal 
itself through the medium of acquired experience. If a child is 
deprived of the common Experience of its native culture then what- 
ever innate powers may be involved in intelligent behaviour will 
find only inadequate means of expression. The items in. the original 
Binet tests of general intelligence were chosen and worded on the 
supposition that they could be answered by means of activities and 
experiences which a child in that culture background might have 
been expected to "pick up" in the ordinary course of living at 
home. Or in school, or at play. In other words, it was taken for 
granted that in every respect except the aptitudes which are involved ^ 

* If. indeed, there is to be a time limit. It has been shown Hhat rural 
children are seriously handicapped in time-limited tests. See Morton 
and Butcher. Brit. Jnl. Ed. Psych. XXXIIIr I Feb. 1963, p. 22. 
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in intelligent behaviour all the children to be tested started equal. 
This supposition must not be taken for granted. 
^ Before &ny standardised test can be used reliably for purposes of 
comparison there must be some guarantee that all the individuals 
in the population tested have had equal opportunities both through 
in-school teaching and oiit-of-school experiehce to develop the 
aptitudes or acquire the attainppents which it is the purpose of the 
test to detect and measure. 

Thus, in testing bilingual populations in a mixed language area 
some means of estimating degrees of bilingual background and 
socio-economic status must be available both during the processes 
of standardisation and subsequently in testing bilingual populations. 
It has been shown, for exainf)le in the construction of a **Welsh 
Longuistk: Background Scale" ^ that there was in a mixed language 
-populaticn used for the investigation a coefficient of correlation 
between Welsh Linguistic Background scorps and scores on a Welsh 
Vocabulary Test of 0,85 which indicates a close reftition between 
the two; the richer the Welsh background of the pupil the higher 
the score on the Vocabulary test. 

These results have an obvious bearing on the problem of con- 
structing standardised tests in Welsh, in a mixed language area. 

In the first place, the population is not homogen6ous with respect 
lOb linguistic background and attainment in Welsh. Not all the pupils 
have equal opportunities for acquiring either Welsh or English 
speech. Differences in linguistic background ** saturation " must be < 
estimated and allowed for in determining the norms of attainment 
which may reasonably be expected of any gi^ven " bilii^ual " pupil. 

Put in another form the question is — What is, in fact, a represen- 
tative sample of Welsh pupils with respect to attainment in Welsh 
or English? Suppose, for example, we wish to construct and stan- 
dardise a test of educable capacity (** intelligence ") verbal or non- 
verbal, or of attainment in some subject involving language. What 
sort of simple should be used for estimating the attainment-levels 
' which may be expected of a hypothetically average child at a given' 
chronological age? If a sample is selected containing all grades of 
language mixture from mainly Welsh to maiilly English, then those 
norms must not bel used, strictly speaking, to test th^ attainments 
of other samples unless the proportions of language mixture are 
comparable. The same principle applies to s6§ija-economic status. 
Moreover, the norms so obtained would be useless for( practical 
guidance. As. an indication of the levels^'of attainment:-for-age and 
rates of progress which may reasonably '^be expected^ ^<^^l^}J^upils 
learning English and Welsh concurrently, norms bas^ ^o^ the^ 
average of all degrees of language mixture will be misleading since 
they will effectively mask the varying influence of differert^ degre^s 

* M. E. Gwenda Rees. ** A Welsh Linguistic Backgroutid I Scale." 
Aberystwyth Collegiate Faculty of Education. Pamphlet No. 2. 1954. 
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of Welsh and English background saturation and of socio-economic 
status on attainment in either language. To set the same targets of 
attaimnent-for-age in Welsh for pupils with all degrees of Welsh 
linguistic background will be absurd, as the Advisory Council said 
in its report. Norms for the standardised tests must be computed 
for various ranges of linguistic background ^scores. Further, when 
tests standardisod in this way are to be used, to measure cducable 
capacity or Scholastic attainment the linguistic background scores 
of the pupiW to be tested must first be ascertained. 

'These problems are not insoluble. What it amounts to is that a 
good deal of preliminary T^earch needs to be done by competent 
research workers {o provide the basic data required for further 
investigations. A start has already been made. The following tests 
in Welsh are already available: — 



Rees, M. E. Gwenda. a Welsh Linguistic Background Scale. 
Collegiate Faculty of Education, Aberystwyth. Pamphlet 
No. 2. 1954. 

A Linguistic Background Scale 

prepared by members- of the Collegiate Faculty qf Educa- 
tion, University College of North Wales, Bangor. Sec T. 
R. Miles '* BilinguaUsm in Caernarvonshire" published 
by the University of Wales Press. 

A Linguistic Background Questionnaire (1^0) 

prepared by W. R. Jones and J. R. Morrison for the 
National Foundation for Educational Research. Form N.S. 
74 B. 

Profion Dealltwriaeth Di-iaitiu 

A Welsh adaptation by W. R. Jones of Jenkins' Scale of 
Non-Verbal Mental Ability (with a Manual of Instruction 
in Welsh), Age-range 10-12, published by the National 
Foundation for Educational Research, for the Collegiate 
Faculty of Education, University College of North Wales, 
Bangor. 

Cotswold Mental Ability Test' (Verbal) for the age-range 10-12 
years. Adapted into Welsh by W. R. Jones from the 
original test by J.'W. Jenkins, published by R. Gibson & 
Sons, Queen St., Glasgow, fC.l. 

Brace, J. L. A Welsh Word-Recognition Test. Collegiate Faculty 
of Education, Aberystwyth. Pamphlet No. 5. 

45 • 



ERLC 



Emmett, W. G. Dee-side Non^Verbal Reasoning Test No 1 for 
. age-range 10-12 ycars^ 
This is a non-verbal test in which the instructions are given 
in both English and Welsh prepared for the Collegiate* 
Facuhjrof Education. Aberystwyth, in co-operation with 
the Carmarthenshire L.E.A. Published by Harrap & Co; 
Ltd.. High Holborn. London. W.CI. (A closed test for 
u§e by Local Education Authorities only). 

Prawf Cymraeg {Cl). An attainment test in Welsh Language Usage 
intended for pupils aged 10 years 5 months to II years 
6 months having Welsh as their first language. (Available 
to Local Education Authorities only). Published by the 
Collegiate Faculty of Education. University College of 
Wales. Aberystwyth. 
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